Search CORE

59 research outputs found

Computing Low-Rank Approximation of a Dense Matrix on Multicore CPUs with a GPU and Its Application to Solving a Hierarchically Semiseparable Linear System of Equations

Author: Dongarra Jack
Tomov Stanimire
Yamazaki Ichitaro
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2015
Field of study

Low-rank matrices arise in many scientific and engineering computations. Both computational and storage costs of manipulating such matrices may be reduced by taking advantages of their low-rank properties. To compute a low-rank approximation of a dense matrix, in this paper, we study the performance of QR factorization with column pivoting or with restricted pivoting on multicore CPUs with a GPU. We first propose several techniques to reduce the postprocessing time, which is required for restricted pivoting, on a modern CPU. We then examine the potential of using a GPU to accelerate the factorization process with both column and restricted pivoting. Our performance results on two eight-core Intel Sandy Bridge CPUs with one NVIDIA Kepler GPU demonstrate that using the GPU, the factorization time can be reduced by a factor of more than two. In addition, to study the performance of our implementations in practice, we integrate them into a recently developed software StruMF which algebraically exploits such low-rank structures for solving a general sparse linear system of equations. Our performance results for solving Poisson's equations demonstrate that the proposed techniques can significantly reduce the preconditioner construction time of StruMF on the CPUs, and the construction time can be further reduced by 10%–50% using the GPU

Directory of Open Access Journals

The University of Manchester - Institutional Repository

Mixed-precision cholesky QR factorization and its case studies on multicore CPU with multiple GPUs

Author: Dongarra Jack
Tomov Stanimire
Yamazaki Ichitaro
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 12/05/2015
Field of study

The University of Manchester - Institutional Repository

An Experimental Study of Two-Level Schwarz Domain Decomposition Preconditioners on GPUs

Author: Heinlein Alexander
Rajamanickam Sivasankaran
Yamazaki Ichitaro
Publication venue
Publication date: 10/04/2023
Field of study

The generalized Dryja--Smith--Widlund (GDSW) preconditioner is a two-level overlapping Schwarz domain decomposition (DD) preconditioner that couples a classical one-level overlapping Schwarz preconditioner with an energy-minimizing coarse space. When used to accelerate the convergence rate of Krylov subspace iterative methods, the GDSW preconditioner provides robustness and scalability for the solution of sparse linear systems arising from the discretization of a wide range of partial different equations. In this paper, we present FROSch (Fast and Robust Schwarz), a domain decomposition solver package which implements GDSW-type preconditioners for both CPU and GPU clusters. To improve the solver performance on GPUs, we use a novel decomposition to run multiple MPI processes on each GPU, reducing both solver's computational and storage costs and potentially improving the convergence rate. This allowed us to obtain competitive or faster performance using GPUs compared to using CPUs alone. We demonstrate the performance of FROSch on the Summit supercomputer with NVIDIA V100 GPUs, where we used NVIDIA Multi-Process Service (MPS) to implement our decomposition strategy. The solver has a wide variety of algorithmic and implementation choices, which poses both opportunities and challenges for its GPU implementation. We conduct a thorough experimental study with different solver options including the exact or inexact solution of the local overlapping subdomain problems on a GPU. We also discuss the effect of using the iterative variant of the incomplete LU factorization and sparse-triangular solve as the approximate local solver, and using lower precision for computing the whole FROSch preconditioner. Overall, the solve time was reduced by factors of about

2\times

using GPUs, while the GPU acceleration of the numerical setup time depend on the solver options and the local matrix sizes.Comment: Accepted for publication in IPDPS'2

arXiv.org e-Print Archive

Partitioning, Ordering, and Load Balancing in a Hierarchically Parallel Hybrid Linear Solver

Author: Li Xiaoye,
Rouet François-Henry
Uçar Bora
Yamazaki Ichitaro
Publication venue: HAL CCSD
Publication date: 01/11/2011
Field of study

Institut National Polytechnique de Toulouse, RT-APO-12-2PDSLin is a general-purpose algebraic parallel hybrid (direct/iterative) linear solver based on the Schur complement method. The most challenging step of the solver is the computation of a preconditioner based on an approximate global Schur complement. We investigate two combinatorial problems to enhance PDSLin's performance at this step. The first is a multi-constraint partitioning problem to balance the workload while computing the preconditioner in parallel. For this, we describe and evaluate a number of graph and hypergraph partitioning algorithms to satisfy our particular objective and constraints. The second problem is to reorder the sparse right-hand side vectors to improve the data access locality during the parallel solution of a sparse triangular system with multiple right-hand sides. This is to speed up the process of eliminating the unknowns associated with the interface. We study two reordering techniques: one based on a postordering of the elimination tree and the other based on a hypergraph partitioning. To demonstrate the effect of these techniques on the performance of PDSLin, we present the numerical results of solving large-scale linear systems arising from two applications of our interest: numerical simulations of modeling accelerator cavities and of modeling fusion devices

HAL-ENS-LYON

CiteSeerX

Scientific Publications of the University of Toulouse II Le Mirail

INRIA a CCSD electronic archive server

Hal-Diderot

Exploiting Block Structures of KKT Matrices for Efficient Solution of Convex Optimization Problems

Author: Dongarra Jack
Iqbal Zafar
Nooshabadi Saeid
Tomov Stanimire
Yamazaki Ichitaro
Publication venue: Digital Commons @ Michigan Tech
Publication date: 01/01/2021
Field of study

Convex optimization solvers are widely used in the embedded systems that require sophisticated optimization algorithms including model predictive control (MPC). In this paper, we aim to reduce the online solve time of such convex optimization solvers so as to reduce the total runtime of the algorithm and make it suitable for real-time convex optimization.We exploit the property of the Karush–Kuhn–Tucker (KKT) matrix involved in the solution of the problem that only some parts of the matrix change during the solution iterations of the algorithm. Our results show that the proposed method can effectively reduce the runtime of the solvers

Michigan Technological University

Directory of Open Access Journals

Performance of random sampling for computing low-rank approximations of a dense matrix on GPUs

Author: Dongarra Jack
Kurzak Jakub
Luszczek Piotr
Mary Théo
Tomov Stanimire
Yamazaki Ichitaro
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 15/11/2015
Field of study

International audienceA low-rank approximation of a dense matrix plays an important role in many applications. To compute such an approximation , a common approach uses the QR factorization with column pivoting (QRCP). Though the reliability and efficiency of QRCP have been demonstrated, this determin-istic approach requires costly communication at each step of the factorization. Since such communication is becoming increasingly expensive on modern computers, an alternative approach based on random sampling, which can be implemented using communication-optimal kernels, is becoming attractive. To study its potential, in this paper, we compare the performance of random sampling with that of QRCP on an NVIDIA Kepler GPU. Our performance results demonstrate that random sampling can be up to 12.8× faster than the deterministic approach for computing the approximation of the same accuracy. We also present the parallel scaling of the random sampling over multiple GPUs on a single compute node, showing a speedup of 3.8× over three Kepler GPUs. These results demonstrate the potential of the random sampling as an excellent computational tool for many applications, and its potential is likely to grow on the emerging computers with the increasing communication costs

Crossref

Mixed-Precision Cholesky QR Factorization and Its Case Studies on Multicore CPU with Multiple GPUs

Author: Cuthill E.
Ichitaro Yamazaki
Jack Dongarra
Kielbasiński A.
Stanimire Tomov
Yamazaki I.
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date
Field of study

Crossref

Sparse direct solvers with accelerators over DAG runtimes

Author: Dongarra Jack
Faverge Mathieu
Ichitaro Yamazaki
Lacoste Xavier
Ramet Pierre
Publication venue: HAL CCSD
Publication date: 01/01/2012
Field of study

The current trend in the high performance computing shows a dramatic increase in the number of cores on the shared memory compute nodes. Algorithms, especially those related to linear algebra, need to be adapted to these new computer architectures in order to be efficient. PASTIX is a sparse parallel direct solver, that incorporates a dynamic scheduler for strongly hierarchical modern architectures. In this paper, we study the replacement of this internal highly integrated scheduling strategy by two generic runtime frameworks: DAGUE and STARPU. Those runtimes will give the opportunity to execute the factorization tasks graph on emerging computers equipped with accelerators. As for previous work done in dense linear algebra, we present the kernels used for GPU computations inspired by the MAGMA library and the DAG algorithm used with those two runtimes. A comparative study of the performances of the supernodal solver with the three different schedulers is performed on manycore architectures and the improvements obtained with accelerators are presented with the STARPU runtime. These results demonstrate that these DAG runtimes provide uniform programming interfaces to obtain high performance on different architectures on irregular problems as sparse direct factorizations

INRIA a CCSD electronic archive server

Recommended from our members

Preparing sparse solvers for exascale computing.

Author: Anzt Hartwig
Boman Erik
Curfman McInnes Lois
Falgout Rob
Ghysels Pieter
Heroux Michael
Li Xiaoye
Meier Yang Ulrike
Rajamanickam Sivasankaran
Rupp Karl
Smith Barry
Tran Mills Richard
Yamazaki Ichitaro
Publication venue: eScholarship, University of California
Publication date: 01/03/2020
Field of study

Sparse solvers provide essential functionality for a wide variety of scientific applications. Highly parallel sparse solvers are essential for continuing advances in high-fidelity, multi-physics and multi-scale simulations, especially as we target exascale platforms. This paper describes the challenges, strategies and progress of the US Department of Energy Exascale Computing project towards providing sparse solvers for exascale computing platforms. We address the demands of systems with thousands of high-performance node devices where exposing concurrency, hiding latency and creating alternative algorithms become essential. The efforts described here are works in progress, highlighting current success and upcoming challenges. This article is part of a discussion meeting issue 'Numerical algorithms for high-performance computational science'

eScholarship - University of California